Lecture 8:
Basic Data Analysis with R
2024-11-21
Long and wide data. Source: Hugo Tavares
Consider the following data frame schwiizerChuchi. This dataset records the popularity ratings (on a scale of 1 to 10) of various Swiss dishes in different regions of Switzerland:
Which of the following statements is true?
nrow(schwiizerChuchiLong) == 12 returns TRUEdim(schwiizerChuchiLong) returns c(3, 12)dim(schwiizerChuchi) returns c(3, 12)mean(schwiizerChuchiLong$Raclette) == 8.333Why is this data frame not tidy, and what would you do to make it tidy? Write down your reasoning in numbered steps. You can write down some exact code, some higher-level code concepts, or in plain text.
Why is this data frame not tidy, and what would you do to make it tidy? Write down your reasoning in numbered steps. You can write down some exact code, some higher-level code concepts, or in plain text.
dplyr.Join setup. Source: R4DS.
Join setup. Source: R4DS.
Inner join. Source: R4DS
Left join. Source: R4DS.
Right join. Source: R4DS.
Full join. Source: R4DS.
Join Venn Diagramm. Source: R4DS.
# load packages
library(tidyverse)
# initiate data frame on persons personal spending
df_c <- data.frame(id = c(1:3,1:3),
money_spent= c(1000, 2000, 6000, 1500, 3000, 5500),
currency = c("CHF", "CHF", "USD", "EUR", "CHF", "USD"),
year=c(2017,2017,2017,2018,2018,2018))
df_c id money_spent currency year
1 1 1000 CHF 2017
2 2 2000 CHF 2017
3 3 6000 USD 2017
4 1 1500 EUR 2018
5 2 3000 CHF 2018
6 3 5500 USD 2018
# initiate data frame on persons' characteristics
df_p <- data.frame(id = 1:4,
first_name = c("Anna", "Betty", "Claire", "Diane"),
profession = c("Economist", "Data Scientist",
"Data Scientist", "Economist"))
df_p id first_name profession
1 1 Anna Economist
2 2 Betty Data Scientist
3 3 Claire Data Scientist
4 4 Diane Economist
id first_name profession money_spent currency year
1 1 Anna Economist 1000 CHF 2017
2 1 Anna Economist 1500 EUR 2018
3 2 Betty Data Scientist 2000 CHF 2017
4 2 Betty Data Scientist 3000 CHF 2018
5 3 Claire Data Scientist 6000 USD 2017
6 3 Claire Data Scientist 5500 USD 2018
7 4 Diane Economist NA <NA> NA
Overview by R4DS:
| dplyr (tidyverse) | base::merge |
|---|---|
inner_join(x, y) |
merge(x, y) |
left_join(x, y) |
merge(x, y, all.x = TRUE) |
right_join(x, y) |
merge(x, y, all.y = TRUE), |
full_join(x, y) |
merge(x, y, all = TRUE) |
Source: https://www.storybench.org/wp-content/uploads/2017/05/tidyverse.png
select, filter, arrange, mutate are the building blocks of dplyrpipelinepiping %>% operator is to chain one function after another without the need to assign intermediate variables.|>.pipeline with dplyr# Traditional way
mydf <- data(swiss)
mydf <- arrange(mydf, -Catholic)
mydf <- filter(mydf, Education > 8 & Catholic > 90)
mydf <- mutate(mydf, Country = "Switzerland")
mydf <- select(mydf, Examination)
# The pipe way
mydf <- data(swiss) |>
arrange(-Catholic) |>
filter(Education > 8 & Catholic > 90) |>
mutate(Country = "Switzerland") |>
select(Examination)dplyrforecats to deal with factors;lubridate to deal with dates;stringr to deal with strings and regular expressions.Source: R4DS, “Statistics for Public Policy” by Jeremy G. Weber, 2024.
Rsummary()table()skimr, summarytools, janitor (also cleaning)mean()).swiss data# Load the built-in 'swiss' dataset
data("swiss")
swiss <- rownames_to_column(swiss, var = "municipality")
# Add an outlier in Lausanne
swiss[swiss$municipality == "Lausanne", "Infant.Mortality"] <- 100
# Add a NA for Agriculture in Gruyere
swiss[swiss$municipality == "Gruyere", "Agriculture"] <- NA municipality Fertility Agriculture Examination Education Catholic Infant.Mortality
1 Courtelary 80.2 17.0 15 12 9.96 22.2
2 Delemont 83.1 45.1 6 9 84.84 22.2
3 Franches-Mnt 92.5 39.7 5 5 93.40 20.2
4 Moutier 85.8 36.5 12 7 33.77 20.3
5 Neuveville 76.9 43.5 17 15 5.16 20.6
6 Porrentruy 76.1 35.3 9 7 90.57 26.6
7 Broye 83.8 70.2 16 7 92.85 23.6
8 Glane 92.4 67.8 14 8 97.16 24.9
9 Gruyere 82.4 NA 12 7 97.67 21.0
10 Sarine 82.9 45.2 16 13 91.38 24.4
11 Veveyse 87.1 64.5 14 6 98.61 24.5
12 Aigle 64.1 62.0 21 12 8.52 16.5
13 Aubonne 66.9 67.5 14 7 2.27 19.1
14 Avenches 68.9 60.7 19 12 4.43 22.7
15 Cossonay 61.7 69.3 22 5 2.82 18.7
16 Echallens 68.3 72.6 18 2 24.20 21.2
17 Grandson 71.7 34.0 17 8 3.30 20.0
18 Lausanne 55.7 19.4 26 28 12.11 100.0
19 La Vallee 54.3 15.2 31 20 2.15 10.8
20 Lavaux 65.1 73.0 19 9 2.84 20.0
21 Morges 65.5 59.8 22 10 5.23 18.0
22 Moudon 65.0 55.1 14 3 4.52 22.4
23 Nyone 56.6 50.9 22 12 15.14 16.7
24 Orbe 57.4 54.1 20 6 4.20 15.3
25 Oron 72.5 71.2 12 1 2.40 21.0
26 Payerne 74.2 58.1 14 8 5.23 23.8
27 Paysd'enhaut 72.0 63.5 6 3 2.56 18.0
28 Rolle 60.5 60.8 16 10 7.72 16.3
29 Vevey 58.3 26.8 25 19 18.46 20.9
30 Yverdon 65.4 49.5 15 8 6.10 22.5
31 Conthey 75.5 85.9 3 2 99.71 15.1
32 Entremont 69.3 84.9 7 6 99.68 19.8
33 Herens 77.3 89.7 5 2 100.00 18.3
34 Martigwy 70.5 78.2 12 6 98.96 19.4
35 Monthey 79.4 64.9 7 3 98.22 20.2
36 St Maurice 65.0 75.9 9 9 99.06 17.8
37 Sierre 92.2 84.6 3 3 99.46 16.3
38 Sion 79.3 63.1 13 13 96.83 18.1
39 Boudry 70.4 38.4 26 12 5.62 20.3
40 La Chauxdfnd 65.7 7.7 29 11 13.79 20.5
41 Le Locle 72.7 16.7 22 13 11.22 18.9
42 Neuchatel 64.4 17.6 35 32 16.92 23.0
43 Val de Ruz 77.6 37.6 15 7 4.97 20.0
44 ValdeTravers 67.6 18.7 25 7 8.65 19.5
45 V. De Geneve 35.0 1.2 37 53 42.34 18.0
46 Rive Droite 44.7 46.6 16 29 50.43 18.2
47 Rive Gauche 42.8 27.7 22 29 58.33 19.3
swiss data municipality Fertility Agriculture Examination Education
Length:47 Min. :35.00 Min. : 1.20 Min. : 3.00 Min. : 1.00
Class :character 1st Qu.:64.70 1st Qu.:35.60 1st Qu.:12.00 1st Qu.: 6.00
Mode :character Median :70.40 Median :54.60 Median :16.00 Median : 8.00
Mean :70.14 Mean :50.60 Mean :16.49 Mean :10.98
3rd Qu.:78.45 3rd Qu.:67.72 3rd Qu.:22.00 3rd Qu.:12.00
Max. :92.50 Max. :89.70 Max. :37.00 Max. :53.00
NA's :1
Catholic Infant.Mortality
Min. : 2.150 Min. : 10.80
1st Qu.: 5.195 1st Qu.: 18.15
Median : 15.140 Median : 20.00
Mean : 41.144 Mean : 21.64
3rd Qu.: 93.125 3rd Qu.: 22.20
Max. :100.000 Max. :100.00
swiss data| Name | swiss |
| Number of rows | 47 |
| Number of columns | 7 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| numeric | 6 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| municipality | 0 | 1 | 4 | 12 | 0 | 47 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Fertility | 0 | 1.00 | 70.14 | 12.49 | 35.00 | 64.70 | 70.40 | 78.45 | 92.5 | ▂▂▇▇▅ |
| Agriculture | 1 | 0.98 | 50.60 | 22.96 | 1.20 | 35.60 | 54.60 | 67.72 | 89.7 | ▃▃▅▇▅ |
| Examination | 0 | 1.00 | 16.49 | 7.98 | 3.00 | 12.00 | 16.00 | 22.00 | 37.0 | ▅▇▆▂▂ |
| Education | 0 | 1.00 | 10.98 | 9.62 | 1.00 | 6.00 | 8.00 | 12.00 | 53.0 | ▇▃▁▁▁ |
| Catholic | 0 | 1.00 | 41.14 | 41.70 | 2.15 | 5.20 | 15.14 | 93.12 | 100.0 | ▇▁▁▁▅ |
| Infant.Mortality | 0 | 1.00 | 21.64 | 12.04 | 10.80 | 18.15 | 20.00 | 22.20 | 100.0 | ▇▁▁▁▁ |
swiss data# A tibble: 2 × 2
`Catholic > 50` `mean(Fertility)`
<lgl> <dbl>
1 FALSE 66.2
2 TRUE 76.5
swiss |>
group_by(Catholic > 50) |>
summarize(across(.cols = c(Fertility, Education),
.fns = list("min" = min, "mean" = mean, "max" = max))) # A tibble: 2 × 7
`Catholic > 50` Fertility_min Fertility_mean Fertility_max Education_min Education_mean
<lgl> <dbl> <dbl> <dbl> <int> <dbl>
1 FALSE 35 66.2 85.8 1 12.1
2 TRUE 42.8 76.5 92.5 2 9.11
# ℹ 1 more variable: Education_max <int>
Use what we just saw in the lecture to solve the following problem. You have the following dataset: